\( \newcommand{\water}{{\rm H_{2}O}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\E}{\mathbb{E}} \newcommand{\d}{\mathop{}\!\mathrm{d}} \newcommand{\grad}{\nabla} \newcommand{\T}{^\text{T}} \newcommand{\mathbbone}{\unicode{x1D7D9}} \renewcommand{\:}{\enspace} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator{\Tr}{Tr} \newcommand{\norm}[1]{\lVert #1\rVert} \newcommand{\KL}[2]{ \text{KL}\left(\left.\rule{0pt}{10pt} #1 \; \right\| \; #2 \right) } \newcommand{\slashfrac}[2]{\left.#1\middle/#2\right.} \)

Latent neural process (LNP): Derivation of an ELBO-like objective

Latent neural processes (LNPs) #1 use a training objective that is inspired from the ELBO. The steps for the derivation are the following:

Usual ELBO

Suppose we have a dataset \(\; D = \left\{ (x_i, y_i)_{i = 1}^{n} \right\} \;\) that has been generated from a latent variable \(\; z \;\) by a model \(\; p_\theta(y_D \mid z, x_D) \;\). We could estimate the parameters \(\; \theta \;\) by maximum likelihood, by maximizing the marginal likelihood of \(\; y_D = {y_i}_{i=1}^{n} \;\). The log marginal likelihood can be decomposed as

\[ \log p(y_D \mid x_D) \:=\: \KL{q(z \mid x_D, y_D)}{p(z \mid x_D, y_D)} \:+\: \mathbb{E}_q \left[ \log p(y_D \mid x_D, z) \right] + \mathbb{E}_q \left[ \log \frac{p(z)}{q(z \mid x_D, y_D)} \right], \]

which results in the usual ELBO bound

\[ \log p(y_D \mid x_D) \:\geq\: \mathbb{E}_q \left[ \log p(y_D \mid x_D, z) \right] + \mathbb{E}_q \left[ \log \frac{p(z)}{q(z \mid x_D, y_D)} \right]. \]

Note that this is equivalent to the ELBO bound of a VAE for a single datapoint \(\; \mathbf{y} := y_D \;\) originating from a latent variable \(\; z \;\), with the only difference that here we are conditioning everything on the inputs \(\; x_D \;\) too (by design).

Conditional ELBO (intractable)

Now, suppose that you partition the dataset into a context set \(\; C = \left\{ (x_i, y_i)_{i=1}^{m} \right\} \;\) and a target set \(\; T = \left\{ (x_i, y_i)_{i=m+1}^{n} \right\} \;\). The goal is to infer \(\; y_T \;\) given information from \(\; y_C \;\). We could try to obtain appropriate parameters \(\; \theta \;\) by maximizing the same marginal likelihood as before, \(\; p(y_D \mid x_D) \;\). However, this is an indirect objective, since it represents maximizing the likelihood of the entire dataset. What we really want to maximize is the conditional marginal likelihood \(\; p(y_T \mid x_D, y_C) \;\). Following the same VAE analogy as before, this would be equivalent to reconstructing part of a datapoint based on the rest of the datapoint (for example, reconstructing the left side of an image based on the right side).

We can obtain an ELBO bound for this conditional marginal likelihood as follows. The LHS of the usual ELBO can be rewritten as

\[ \log p(y_D \mid x_D) \;=\; \log \big(\, p(y_T \mid x_D, y_C) \, \, p(y_C \mid x_D) \,\big) \;=\; \log p(y_T \mid x_D, y_C) \;+ \; \log p(y_C \mid x_D) \]

Similarly, the RHS can be rewritten as

\[ \mathbb{E}_q \Big[ \log \big( \, p(y_T \mid z, x_D) \,\,{\color{red} p(y_C \mid z, x_D)}\, \Big] \; + \; \mathbb{E}_q \left[ \log \frac{{{\color{red} p(z)}} \, \, p(z \mid x_D, y_C)}{q(z \mid x_D, y_D) \, \, {\color{red} p(z \mid x_D, y_C)}} \right]. \]

We can obtain the desired ELBO by substracting \(\; \log p(y_C \mid x_D) \;\) from both sides and \(\; \color{red} \text{collecting the operands in red} \;\) in the RHS. In this way, after substracting this term the LHS becomes the desired bound. The RHS becomes

\[ \mathbb{E}_q \left[ \log p(y_T \mid z, x_D) \right] \;+\; \mathbb{E}_q \left[ \log \frac{p(z \mid x_D, y_C)}{q(z \mid x_D, y_D)} \right] \;+\; \mathbb{E}_q \left[ \log \frac{{\color{red} p(y_C \mid z, x_D) \,p(z)}}{\color{red} p(z \mid x_D, y_C)} - \log p(y_C \mid x_D) \right], \]

where the last term cancels out:

\begin{align} \log \frac{p(y_C \mid z, x_D) \, p(z)}{p(z \mid x_D, y_C)} - \log p(y_C \mid x_D) & \;=\; \log \frac{p(y_C, z \mid x_D)}{p(z \mid x_D, y_C) \, p(y_C \mid x_D)} \\[10pt] & \;= \; \log \frac{p(y_C, z \mid x_D)}{p(y_C, z \mid x_D)} = \log 1 = 0. \end{align}

Thus, we arrive at

\[ \log p(y_T \mid x_D, y_C) \:\geq\: \mathbb{E}_q \left[ \log p(y_T \mid z, x_D) \right] \;+\; \mathbb{E}_q \left[ \log \frac{p(z \mid x_D, y_C)}{q(z \mid x_D, y_D)} \right], \]

which becomes the desired bound after making the (reasonable) assumption that each \(\; y_i \;\) prediction can only be informed by its corresponding \(\; x_i \;\) input, for example so that \(\; p(y_T \mid z, x_D) \;\) becomes \(\; p(y_T \mid z, x_T) \;\):

\begin{align} \log p(y_T \mid x_D, y_C) \:&\geq\: \mathbb{E}_q \left[ \log p(y_T \mid z, x_T) \right] \;+\; \mathbb{E}_q \left[ \log \frac{p(z \mid x_C, y_C)}{q(z \mid x_D, y_D)} \right] \\[10pt] &=\: \mathbb{E}_q \left[ \log p(y_T \mid z, x_T) \right] \;-\; \KL{q(z \mid x_D, y_D)}{p(z \mid x_C, y_C)}. \end{align}

We would be done except for one problem: this expression is unfortunately intractable because \(\; p(z \mid x_C, y_C ) = \frac{p(y_C \mid x_C, z) \, p(z)}{\int p(y_C \mid x_C, z) \, p(z) \, \d y} \;\) is intractable.

Note that, since we have substracted the same term from the marginal likelihood and the ELBO, the overall KL divergence remains the same with respect to the original one, i.e. the overall KL divergence is still

\[ \KL{q(z \mid x_D, y_D)}{p(z \mid x_D, y_D)}. \]

Conditional ELBO-like (tractable)

The previous ELBO is a bound of the desired marginal likelihood \(\; p(y_T \mid x_D, y_C) \;\), but it is intractable because \(\; p(z \mid x_C, y_C) \;\). LNPs circumvent this issue by approximating this term with \(\; q(z \mid x_C, y_C) \;\).

\[ \log p(y_T \mid x_D, y_C) \:\geq\: \mathbb{E}_q \left[ \log p(y_T \mid z, x_T) \right] \;-\; \KL{q(z \mid x_D, y_D)}{q(z \mid x_C, y_C)}. \]

The right term (the KL or regularization term) can now be interpreted in the following way: the variational distribution over the latent variable \(\; z \;\) should be the same when the model has access to full information about the function (\(\; x_D, y_D \;\)) and when the model has only partial information about the function (\(\; x_C, y_C \;\)). This seems a reasonable objective for NPs, which try to recover the whole \(\; y_D, x_D \;\) based only on \(\; y_C, x_C \;\).

Note, however, that this ELBO-like objective is no longer an analytical lower bound of the conditional log marginal likelihood \(\; p(y_T \mid x_D, y_C) \;\), so there is no guarantee that we are maximizing the likelihood of the parameters anymore.


References

1 Garnelo et al 2018. Neural processes.

2 Garnelo et al 2018. Conditional neural processes.